Goto

Collaborating Authors

 Jigawa State


Sure! Here's a short and concise title for your paper: "Contamination in Generated Text Detection Benchmarks"

Dingfelder, Philipp, Riess, Christian

arXiv.org Artificial Intelligence

Large language models are increasingly used for many applications. To prevent illicit use, it is desirable to be able to detect AI-generated text. Training and evaluation of such detectors critically depend on suitable benchmark datasets. Several groups took on the tedious work of collecting, curating, and publishing large and diverse datasets for this task. However, it remains an open challenge to ensure high quality in all relevant aspects of such a dataset. For example, the DetectRL benchmark exhibits relatively simple patterns of AI-generation in 98.5% of the Claude-LLM data. These patterns may include introductory words such as "Sure! Here is the academic article abstract:", or instances where the LLM rejects the prompted task. In this work, we demonstrate that detectors trained on such data use such patterns as shortcuts, which facilitates spoofing attacks on the trained detectors. We consequently reprocessed the DetectRL dataset with several cleansing operations. Experiments show that such data cleansing makes direct attacks more difficult. The reprocessed dataset is publicly available.


BanglaMedQA and BanglaMMedBench: Evaluating Retrieval-Augmented Generation Strategies for Bangla Biomedical Question Answering

Sultana, Sadia, Muna, Saiyma Sittul, Samarukh, Mosammat Zannatul, Abrar, Ajwad, Chowdhury, Tareque Mohmud

arXiv.org Artificial Intelligence

Developing accurate biomedical Question Answering (QA) systems in low-resource languages remains a major challenge, limiting equitable access to reliable medical knowledge. This paper introduces BanglaMedQA and BanglaMMedBench, the first large-scale Bangla biomedical Multiple Choice Question (MCQ) datasets designed to evaluate reasoning and retrieval in medical artificial intelligence (AI). The study applies and benchmarks several Retrieval-Augmented Generation (RAG) strategies, including Traditional, Zero-Shot Fallback, Agentic, Iterative Feedback, and Aggregate RAG, combining textbook-based and web retrieval with generative reasoning to improve factual accuracy. A key novelty lies in integrating a Bangla medical textbook corpus through Optical Character Recognition (OCR) and implementing an Agentic RAG pipeline that dynamically selects between retrieval and reasoning strategies. Experimental results show that the Agentic RAG achieved the highest accuracy 89.54% with openai/gpt-oss-120b, outperforming other configurations and demonstrating superior rationale quality. These findings highlight the potential of RAG-based methods to enhance the reliability and accessibility of Bangla medical QA, establishing a foundation for future research in multilingual medical artificial intelligence.


Dataset Creation and Baseline Models for Sexism Detection in Hausa

Muhammad, Fatima Adam, Hassan, Shamsuddeen Muhammad, Inuwa-Dutse, Isa

arXiv.org Artificial Intelligence

Sexism reinforces gender inequality and social exclusion by perpetuating stereotypes, bias, and discriminatory norms. Noting how online platforms enable various forms of sexism to thrive, there is a growing need for effective sexism detection and mitigation strategies. While computational approaches to sexism detection are widespread in high-resource languages, progress remains limited in low-resource languages where limited linguistic resources and cultural differences affect how sexism is expressed and perceived. This study introduces the first Hausa sexism detection dataset, developed through community engagement, qualitative coding, and data augmentation. For cultural nuances and linguistic representation, we conducted a two-stage user study (n=66) involving native speakers to explore how sexism is defined and articulated in everyday discourse. We further experiment with both traditional machine learning classifiers and pre-trained multilingual language models and evaluating the effectiveness few-shot learning in detecting sexism in Hausa. Our findings highlight challenges in capturing cultural nuance, particularly with clarification-seeking and idiomatic expressions, and reveal a tendency for many false positives in such cases.


Charting the Landscape of African NLP: Mapping Progress and Shaping the Road Ahead

Alabi, Jesujoba O., Hedderich, Michael A., Adelani, David Ifeoluwa, Klakow, Dietrich

arXiv.org Artificial Intelligence

With over 2,000 languages and potentially millions of speakers, Africa represents one of the richest linguistic regions in the world. Yet, this diversity is scarcely reflected in state-of-the-art natural language processing (NLP) systems and large language models (LLMs), which predominantly support a narrow set of high-resource languages. This exclusion not only limits the reach and utility of modern NLP technologies but also risks widening the digital divide across linguistic communities. Nevertheless, NLP research on African languages is active and growing. In recent years, there has been a surge of interest in this area, driven by several factors-including the creation of multilingual language resources, the rise of community-led initiatives, and increased support through funding programs. In this survey, we analyze 884 research papers on NLP for African languages published over the past five years, offering a comprehensive overview of recent progress across core tasks. We identify key trends shaping the field and conclude by outlining promising directions to foster more inclusive and sustainable NLP research for African languages.


The Rise of AfricaNLP: Contributions, Contributors, and Community Impact (2005-2025)

Belay, Tadesse Destaw, Hussen, Kedir Yassin, Imam, Sukairaj Hafiz, Ahmad, Ibrahim Said, Inuwa-Dutse, Isa, Haile, Abrham Belete, Sidorov, Grigori, Ameer, Iqra, Abdulmumin, Idris, Gwadabe, Tajuddeen, Marivate, Vukosi, Yimam, Seid Muhie, Muhammad, Shamsuddeen Hassan

arXiv.org Artificial Intelligence

Natural Language Processing (NLP) is undergoing constant transformation, as Large Language Models (LLMs) are driving daily breakthroughs in research and practice. In this regard, tracking the progress of NLP research and automatically analyzing the contributions of research papers provides key insights into the nature of the field and the researchers. This study explores the progress of African NLP (AfricaNLP) by asking (and answering) basic research questions such as: i) How has the nature of NLP evolved over the last two decades?, ii) What are the contributions of AfricaNLP papers?, and iii) Which individuals and organizations (authors, affiliated institutions, and funding bodies) have been involved in the development of AfricaNLP? We quantitatively examine the contributions of AfricaNLP research using 1.9K NLP paper abstracts, 4.9K author contributors, and 7.8K human-annotated contribution sentences (AfricaNLPContributions) along with benchmark results. Our dataset and continuously existing NLP progress tracking website provide a powerful lens for tracing AfricaNLP research trends and hold potential for generating data-driven literature surveys.


OpenAI's GPT-OSS-20B Model and Safety Alignment Issues in a Low-Resource Language

Inuwa-Dutse, Isa

arXiv.org Artificial Intelligence

In response to the recent safety probing for OpenAI's GPT-OSS-20b model, we present a summary of a set of vulnerabilities uncovered in the model, focusing on its performance and safety alignment in a low-resource language setting. The core motivation for our work is to question the model's reliability for users from underrepresented communities. Using Hausa, a major African language, we uncover biases, inaccuracies, and cultural insensitivities in the model's behaviour. With a minimal prompting, our red-teaming efforts reveal that the model can be induced to generate harmful, culturally insensitive, and factually inaccurate content in the language. As a form of reward hacking, we note how the model's safety protocols appear to relax when prompted with polite or grateful language, leading to outputs that could facilitate misinformation and amplify hate speech. For instance, the model operates on the false assumption that common insecticide locally known as Fiya-Fiya (Cyphermethrin) and rodenticide like Shinkafar Bera (a form of Aluminium Phosphide) are safe for human consumption. To contextualise the severity of this error and popularity of the substances, we conducted a survey (n=61) in which 98% of participants identified them as toxic. Additional failures include an inability to distinguish between raw and processed foods and the incorporation of demeaning cultural proverbs to build inaccurate arguments. We surmise that these issues manifest through a form of linguistic reward hacking, where the model prioritises fluent, plausible-sounding output in the target language over safety and truthfulness. We attribute the uncovered flaws primarily to insufficient safety tuning in low-resource linguistic contexts. By concentrating on a low-resource setting, our approach highlights a significant gap in current red-teaming effort and offer some recommendations.


Evaluation Sheet for Deep Research: A Use Case for Academic Survey Writing

Azime, Israel Abebe, Belay, Tadesse Destaw, Tonja, Atnafu Lambebo

arXiv.org Artificial Intelligence

Large Language Models (LLMs) powered with argentic capabilities are able to do knowledge-intensive tasks without human involvement. A prime example of this tool is Deep research with the capability to browse the web, extract information and generate multi-page reports. In this work, we introduce an evaluation sheet that can be used for assessing the capability of Deep Research tools. In addition, we selected academic survey writing as a use case task and evaluated output reports based on the evaluation sheet we introduced. Our findings show the need to have carefully crafted evaluation standards. The evaluation done on OpenAI`s Deep Search and Google's Deep Search in generating an academic survey showed the huge gap between search engines and standalone Deep Research tools, the shortcoming in representing the targeted area.


A Review on Influx of Bio-Inspired Algorithms: Critique and Improvement Needs

Somvanshi, Shriyank, Islam, Md Monzurul, Javed, Syed Aaqib, Chhetri, Gaurab, Islam, Kazi Sifatul, Chowdhury, Tausif Islam, Polock, Sazzad Bin Bashar, Dutta, Anandi, Das, Subasish

arXiv.org Artificial Intelligence

Bio-inspired algorithms utilize natural processes such as evolution, swarm behavior, foraging, and plant growth to solve complex, nonlinear, high-dimensional optimization problems. However, a plethora of these algorithms require a more rigorous review before making them applicable to the relevant fields. This survey categorizes these algorithms into eight groups: evolutionary, swarm intelligence, physics-inspired, ecosystem and plant-based, predator-prey, neural-inspired, human-inspired, and hybrid approaches, and reviews their principles, strengths, novelty, and critical limitations. We provide a critique on the novelty issues of many of these algorithms. We illustrate some of the suitable usage of the prominent algorithms in machine learning, engineering design, bioinformatics, and intelligent systems, and highlight recent advances in hybridization, parameter tuning, and adaptive strategies. Finally, we identify open challenges such as scalability, convergence, reliability, and interpretability to suggest directions for future research. This work aims to serve as a resource for both researchers and practitioners interested in understanding the current landscape and future directions of reliable and authentic advancement of bio-inspired algorithms.


Signals of Provenance: Practices & Challenges of Navigating Indicators in AI-Generated Media for Sighted and Blind Individuals

Ide, Ayae, Park, Tory, Mink, Jaron, Sharma, Tanusree

arXiv.org Artificial Intelligence

AI-Generated (AIG) content has become increasingly widespread by recent advances in generative models and the easy-to-use tools that have significantly lowered the technical barriers for producing highly realistic audio, images, and videos through simple natural language prompts. In response, platforms are adopting provable provenance with platforms recommending AIG to be self-disclosed and signaled to users. However, these indicators may be often missed, especially when they rely solely on visual cues and make them ineffective to users with different sensory abilities. To address the gap, we conducted semi-structured interviews (N=28) with 15 sighted and 13 BLV participants to examine their interaction with AIG content through self-disclosed AI indicators. Our findings reveal diverse mental models and practices, highlighting different strengths and weaknesses of content-based (e.g., title, description) and menu-aided (e.g., AI labels) indicators. While sighted participants leveraged visual and audio cues, BLV participants primarily relied on audio and existing assistive tools, limiting their ability to identify AIG. Across both groups, they frequently overlooked menu-aided indicators deployed by platforms and rather interacted with content-based indicators such as title and comments. We uncovered usability challenges stemming from inconsistent indicator placement, unclear metadata, and cognitive overload. These issues were especially critical for BLV individuals due to the insufficient accessibility of interface elements. We provide practical recommendations and design implications for future AIG indicators across several dimensions.


Who Wrote This? Identifying Machine vs Human-Generated Text in Hausa

Sani, Babangida, Soy, Aakansha, Imam, Sukairaj Hafiz, Mustapha, Ahmad, Aliyu, Lukman Jibril, Abdulmumin, Idris, Ahmad, Ibrahim Said, Muhammad, Shamsuddeen Hassan

arXiv.org Artificial Intelligence

The advancement of large language models (LLMs) has allowed them to be proficient in various tasks, including content generation. However, their unregulated usage can lead to malicious activities such as plagiarism and generating and spreading fake news, especially for low-resource languages. Most existing machine-generated text detectors are trained on high-resource languages like English, French, etc. In this study, we developed the first large-scale detector that can distinguish between human- and machine-generated content in Hausa. We scrapped seven Hausa-language media outlets for the human-generated text and the Gemini-2.0 flash model to automatically generate the corresponding Hausa-language articles based on the human-generated article headlines. We fine-tuned four pre-trained Afri-centric models (AfriTeVa, AfriBERTa, AfroXLMR, and AfroXLMR-76L) on the resulting dataset and assessed their performance using accuracy and F1-score metrics. AfroXLMR achieved the highest performance with an accuracy of 99.23% and an F1 score of 99.21%, demonstrating its effectiveness for Hausa text detection. Our dataset is made publicly available to enable further research.